Character-based LSTM

Grab all Chesterton texts from Gutenberg


In [1]:
from nltk.corpus import gutenberg

gutenberg.fileids()


Out[1]:
['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

In [2]:
text = ''

for txt in gutenberg.fileids():
    if 'chesterton' in txt:
        text += gutenberg.raw(txt).lower()
        
chars = sorted(list(set(text)))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))
'corpus length: {}  total chars: {}'.format(len(text), len(chars))


Out[2]:
'corpus length: 1184604  total chars: 65'

In [3]:
print(text[:100])


[the ball and the cross by g.k. chesterton 1909]


i. a discussion somewhat in the air

the flying s

Create the Training set

Build a training and test dataset. Take 40 characters and then save the 41st character. We will teach the model that a certain 40 char sequence should generate the 41st char. Use a step size of 3 so there is overlap in the training set and we get a lot more 40/41 samples.


In [4]:
maxlen = 40
step = 3
sentences = []
next_chars = []

for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i+maxlen])
    next_chars.append(text[i + maxlen])
    
print("sequences: ", len(sentences))


sequences:  394855

In [5]:
print(sentences[0])
print(sentences[1])


[the ball and the cross by g.k. chestert
e ball and the cross by g.k. chesterton 

In [6]:
print(next_chars[0])


o

One-hot encode


In [7]:
import numpy as np

X = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        X[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

Create the Model


In [8]:
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.layers import LSTM
from keras.optimizers import RMSprop

model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars)))
model.add(Activation('softmax'))
optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)
model.summary()


Using TensorFlow backend.
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
lstm_1 (LSTM)                (None, 128)               99328     
_________________________________________________________________
dense_1 (Dense)              (None, 65)                8385      
_________________________________________________________________
activation_1 (Activation)    (None, 65)                0         
=================================================================
Total params: 107,713
Trainable params: 107,713
Non-trainable params: 0
_________________________________________________________________

Train the Model


In [9]:
epochs = 2
batch_size = 128

model.fit(X, y, batch_size=batch_size, epochs=epochs)


Epoch 1/2
394855/394855 [==============================] - 93s 235us/step - loss: 1.8309
Epoch 2/2
394855/394855 [==============================] - 92s 233us/step - loss: 1.57090s - loss: 
Out[9]:
<keras.callbacks.callbacks.History at 0x1abdd0bafd0>

Generate new sequence


In [10]:
import random

def sample(preds, temperature=1.0):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [11]:
import sys
start_index = random.randint(0, len(text) - maxlen - 1)
for diversity in [0.2, 0.5, 1.0]:
    print()
    print('----- diversity:', diversity)
    generated = ''
    sentence = text[start_index: start_index + maxlen]
    generated += sentence
    print('----- Generating with seed: "' + sentence + '"')
    sys.stdout.write(generated)
    for i in range(400):
        x = np.zeros((1, maxlen, len(chars)))
        for t, char in enumerate(sentence):
            x[0, t, char_indices[char]] = 1.
        preds = model.predict(x, verbose=0)[0]
        next_index = sample(preds, diversity)
        next_char = indices_char[next_index]
        generated += next_char
        sentence = sentence[1:] + next_char
        sys.stdout.write(next_char)
        sys.stdout.flush()
    print()


----- diversity: 0.2
----- Generating with seed: "head and features.  but though she was n"
head and features.  but though she was not believes as the still of the stood and the street of the stand of the stood and the stand of the stand of the stand of the stood and the strong face was the stare and the most contraling that the concertual and the little and the street and the little of the stand of the sense of the street of the street of the distance of the street of the stand of the still was a stand and the street of the s

----- diversity: 0.5
----- Generating with seed: "head and features.  but though she was n"
head and features.  but though she was not between the wall of the other ampered him was before asced and a sick off a respectains of the wild and strong concession alfeg and a sation of the tried that still that i was a cripted life that it was a lipted of the montton of his dreaming that in him.  it was a monght as a man was sort of the seconds and began of the distract of the colours of the solot to he stranged to stand the state of 

----- diversity: 1.0
----- Generating with seed: "head and features.  but though she was n"
head and features.  but though she was now, there was the withont living clother.  but oscro?"
cansiincle the long rush packen melony only off its be. which that in the french asking to his groveter.  i have pree up
rewoutd him oy which was i took drush that creatcable long fillag alsolted himself add our side and not poledd."

"that i am yro?"

"i thoudd it neteled from fam unhabled flams.--heard to be throwed kneepy,
so a miny mind's 

In [ ]: